Fault Tolerance and Resilience: Meanings, Measures and Assessment

نویسنده

  • Lorenzo Strigini
چکیده

To assess in quantitative terms the “resilience” of systems, it is necessary to ask first what is meant by “resilience”, whether it is a single attribute or several, which measure or measures appropriately characterise it. This chapter covers: the technical meanings that the word “resilience” has assumed, and its role in the debates about how best to achieve reliability, safety, etc.; the different possible measures for the attributes that the word designates, with their different pros and cons in terms of ease of empirical assessment and suitability for supporting prediction and decision making; the similarity between these concepts, measures and attached problems in various fields of engineering, and how lessons can be propagated between them.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

Application-Level Resilience Modeling for HPC Fault Tolerance

Understanding the application resilience in the presence of faults is critical to address the HPC resilience challenge. Currently we largely rely on random fault injection (RFI) to quantify the application resilience. However, RFI provides liŠle information on how fault tolerance happens, and RFI results are o‰en not deterministic due to its random nature. In this paper, we introduce a new meth...

متن کامل

Modelling Resilience of Data Processing Capabilities of CPS

Modern CPS should process large amount of data with high speed and reliability. To ensure that the system can handle varying volumes of data, the system designers usually rely on the architectures with the dynamically scaling degree of parallelism. However, to guarantee resilience of data processing, we should also ensure system fault tolerance, i.e., integrate the mechanisms for dynamic reconf...

متن کامل

Using Performance Tools to Support Experiments in HPC Resilience

The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environments, specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has also recognized that tools for resilience...

متن کامل

Restricted connectivity for three families of interconnection networks

Vertex connectivity and edge connectivity are two important parameters in interconnection networks. Even though they reflect the fault tolerance correctly, they undervalue the resilience of large networks. By the concept of conditional connectivity and super-connectivity, the concept of restricted vertex connectivity and restricted edge connectivity of graphs was proposed by Esfahanian [A.H. Es...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012